Skip to content

spec-047 §4.9: ARM64 post-Phase-4 micro perf capture (LAPTOP-4MEP83VI)#465

Merged
codemonkeychris merged 2 commits into
mainfrom
spec-047-phase4-arm64-perf-capture
May 30, 2026
Merged

spec-047 §4.9: ARM64 post-Phase-4 micro perf capture (LAPTOP-4MEP83VI)#465
codemonkeychris merged 2 commits into
mainfrom
spec-047-phase4-arm64-perf-capture

Conversation

@codemonkeychris
Copy link
Copy Markdown
Collaborator

Summary

Captures the post-Phase-4 (V1-default) micro perf state of PerfBench.ControlModel (M1–M13) on LAPTOP-4MEP83VI — the exact ARM64 baseline box that spec-047 §4.9 was blocked on — and compares it against the original 2026-05-25-arm64 baseline already in the repo.

Adds under docs/specs/047/phase4-results/LAPTOP-4MEP83VI/2026-05-29-arm64/:

  • raw JSONL (perfbench-controlmodel-{m1-m8,m9,m10-m13}.jsonl)
  • aggregator-out/ canonical tables (matches the baseline dir shape)
  • RESULTS.md — the cross-baseline interpretation
  • analyze.py — reproducible per-render comparison

Run params match the baseline exactly: ARM64-native / Release / .NET 10.0.8, reps=5, iters M1–M8 @5000 / M9 @2000 / M10–M13 @1000. 195 rows, 0 errors, 0 excluded by the §15.5 env-metadata gate.

Two interpretation notes

  1. ReactorTodayReactor now — §4.5 deleted the legacy dispatch switch, so the harness's "Today" variant runs the same V1 path. The real comparison is current Reactor vs the baseline's ReactorV2/ReactorToday columns (computed in RESULTS.md/analyze.py).
  2. Allocation is valid; timing is not. Managed alloc is deterministic (my Direct alloc matches baseline byte-for-byte). The ns axis is environment-throttled — Direct (zero Reactor code) is +60–140% slower than the baseline run — so cross-baseline timing is disregarded.

Allocation findings (the valid axis)

  • ✅ Most of the V1 path held/improved vs baseline: M2 −5%, M3 −6%, M5 −12%, M8 −15%, M10 −14%, M9 −41% (keyed list).
  • M1 +20% (1,289 vs 1,071 B/render) and M12 +17% — real, deterministic regressions. The leanest leaf (M1) got heavier despite the §4.4 bucketing.
  • §11.6 byte gates: M3 PASS; M1 (3.2× over 407) and M2 (2.4× over 1,520) FAIL.
  • Confirms the spec's KD-3 trigger ("fold the M1 binder check if M1 is still over budget") — M1 is over budget.

⛔ Not a §4.9 ratification sign-off

  • §15.5 isolation (AC / High-Performance / DRR-off / foreground) was not enforced (automated run) → timing contaminated.
  • The §4.9-mandated randomized/interleaved ordering + CPU-clock telemetry is not wired into the harness.
  • The macro suite (L1–L14) can't run — Phase 4 deleted its projects (StressPerf.ReactorV2, BlankReactorV2).

A real ratification needs an isolated stable-AC re-capture + the macro suite rebuilt against the single Reactor variant. This PR is the data + analysis, not the gate close.

🤖 Generated with Claude Code

Indicative M1–M13 capture on the §4.9 baseline box, ARM64-native/Release/.NET
10.0.8, reps=5, iters matched to the 2026-05-25 baseline (M1–M8 @5000, M9 @2000,
M10–M13 @1000). Adds raw JSONL, the aggregator-out tables, RESULTS.md (cross-
baseline comparison), and analyze.py.

Allocation (deterministic, valid — Direct alloc matches baseline byte-for-byte):
- §15.6 "M1–M3 alloc ≤ Today": M2 −5%, M3 −6% PASS; M1 +20% FAIL.
- §11.6 byte gates: M3 PASS; M1 (3.2×) and M2 (2.4×) over target.
- vs baseline ReactorV2: most benches flat/better (M9 −41% standout); M1 +20%
  and M12 +17% are real, deterministic regressions to investigate. Confirms the
  KD-3 trigger (M1 over budget).

NOT a ratification sign-off: §15.5 isolation (AC/High-Perf/DRR/foreground) was
not enforced, so the timing axis is environment-throttled (Direct ns +60–140% vs
baseline) and must be disregarded; the §4.9 randomized/interleaved ordering +
CPU-clock telemetry is not wired; and the macro suite (L1–L14) is unrunnable
(its projects were deleted in Phase 4).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(benchId, variant), averaged across reps. Maps baseline 'ReactorV2' and current
'Reactor' to a common 'Reactor' label (same code lineage — the post-047 V1 path).
"""
import json, glob, sys, statistics, os
#465)

Updates the spec body + both trackers to reflect the indicative LAPTOP-4MEP83VI
capture now that the deterministic allocation axis is measured:
- §4.4/§11.6 byte gates MEASURED: M3 PASS; M1 (3.2×) + M2 (2.4×) FAIL.
- §15.6 "M1–M3 alloc ≤ Today": M2/M3 PASS, M1 +20.3% FAIL; M12 +17% regressed.
- KD-3 trigger CONFIRMED (M1 over budget) — fold warranted + investigate the
  bucketing regression.
- Gate stays OPEN: timing axis throttled (no §15.5 isolation), macro suite
  unrunnable (projects deleted in Phase 4); needs an isolated re-capture + the
  M1/M12 alloc fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codemonkeychris codemonkeychris merged commit b0c8174 into main May 30, 2026
24 of 25 checks passed
@codemonkeychris codemonkeychris deleted the spec-047-phase4-arm64-perf-capture branch May 30, 2026 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant